# ! pip install gdown
# import gdown Unsupervised Learning Model
# **Data Processing**Google Drive file ID (from the link) file_id = “1V2GCHGt2dkFGqVBeoUFckU4IhUgk4ocQ” url = f”https://drive.google.com/uc?id={file_id}”
Download the file gdown.download(url, output=“lightcast_job_postings.csv”, quiet=False)
Downloading...
From (original): https://drive.google.com/uc?id=1V2GCHGt2dkFGqVBeoUFckU4IhUgk4ocQ
From (redirected): https://drive.google.com/uc?id=1V2GCHGt2dkFGqVBeoUFckU4IhUgk4ocQ&confirm=t&uuid=7d118db4-9ed0-489a-9198-14c89e95732f
To: /home/ubuntu/.ssh/ad688-employability-sp25A1-group1-6/lightcast_job_postings.csv
100%|██████████| 717M/717M [00:04<00:00, 147MB/s]
'lightcast_job_postings.csv'
Text Preprocessing: Combine Job Title and Skills into a Single Field for TF-IDF
- We combined the job title and skills into a single text field (combined_text) to create a richer, unified input for the TF-IDF vectorizer. This improves the quality of feature extraction by capturing more context about each job, enabling better clustering and analysis.
Unique Value Counts in Job Titles and Skill Fields
Unique values in 'TITLE_CLEAN': 27266
Unique values in 'SOFTWARE_SKILLS_NAME': 22456
Unique values in 'SPECIALIZED_SKILLS_NAME': 41462
- It helps to assess the diversity and granularity of job titles and skill mentions before clustering or vectorization, which is important for understanding feature richness and potential noise in the dataset.
# **K Means Clustering**Text Vectorization and Feature Scaling for Clustering
#| eval: true
#| echo: false
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, silhouette_score
# Vectorize
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf.fit_transform(df['combined_text']).toarray()
# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_tfidf) We performed text vectorization and feature scaling, which are essential preprocessing steps before clustering
Tfidf Vectorizer converts the cleaned job and skills text (combined_text) into a numeric matrix based on word importance (TF-IDF), enabling text-based clustering.
StandardScaler scales the TF-IDF features to have zero mean and unit variance, which is important because KMeans is sensitive to feature magnitudes.
KMeans Clustering and Evaluation with NAICS 6-Digit Labels
Adjusted Rand Index (NAICS_2022_6_NAME): 0.009
Normalized Mutual Info Score (NAICS_2022_6_NAME): 0.033
Evaluate Clustering Using Multiple Reference Labels (NAICS, SOC, ONET)
#| eval: true
#| echo: false
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
reference_labels = ['NAICS_2022_6_NAME', 'SOC_2021_5_NAME', 'ONET_NAME']
results = []
for label in reference_labels:
df_eval = df[[label, 'cluster']].dropna()
ari = adjusted_rand_score(df_eval[label], df_eval['cluster'])
nmi = normalized_mutual_info_score(df_eval[label], df_eval['cluster'])
results.append({'Reference Label': label, 'ARI': ari, 'NMI': nmi})| Reference Label | ARI | NMI | |
|---|---|---|---|
| 0 | NAICS_2022_6_NAME | 0.0092 | 0.0331 |
| 1 | SOC_2021_5_NAME | 0.0000 | 0.0000 |
| 2 | ONET_NAME | 0.0000 | 0.0000 |
NAICS_2022_6_NAME has the highest agreement with your clusters (though still very low), suggesting a slight alignment with industry-based classification.
SOC and ONET labels have zero alignment — meaning the clusters derived from TF-IDF features of job titles + skills do not correspond to occupation-based taxonomies.